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Abstract. The block Kaczmarz method is an iterative scheme for solving overdetermined least-squares 
problems. At each step, the algorithm projects the current iterate onto the solution space of a subset of 
the constraints. This paper describes a block Kaczmarz algorithm that uses a randomized control scheme 
to choose the subset at each step. This algorithm is the first block Kaczmarz method with an (expected) 
linear rate of convergence that can be expressed in terms of the geometric properties of the matrix and its 
submatrices. The analysis reveals that the algorithm is most effective when it is given a good row paving 
of the matrix, a partition of the rows into well- conditioned blocks. The operator theory literature provides 
detailed information about the existence and construction of good row pavings. Together, these results yield 
an efficient block Kaczmarz scheme that applies to many overdetermined least- squares problem. 



1. Introduction 

The Kaczmarz method | Kac37 1 is an iterative algorithm for solving overdetermined least-squares prob- 
lems. Because of its simplicity and performance, this scheme has found application in fields ranging 
from image reconstruction to digital signal processing ISS87I |CFM + 92| [FS951 INatOll . At each iteration, 
the basic Kaczmarz method makes progress by enforcing a single constraint, while the block Kaczmarz 
method |Elf80 | enforces many constraints at once. This paper introduces a randomized version of the 
block Kaczmarz method that converges with an expected linear rate, and we characterize the perfor- 
mance of this algorithm using geometric properties of the blocks of equations. This analysis leads us to 
consider the concept of a row paving of a matrix, a partition of the rows into well-conditioned blocks. We 
summarize the literature on row pavings, and we explain how this theory interacts with the block Kacz- 
marz method. Together, these results yield an efficient block Kaczmarz scheme that applies to many 
overdetermined least-squares problems. 

1.1. Standing Assumptions and Notation. Let A be a real or complex n x d matrix with full column rank, 
and suppose that b is a vector with dimension n. Consider the overdetermined least-squares problem 

minimize HAx-fol^. (1.1) 

The symbol \\-\\ p refers to the £ p vector norm for p e [l,oo]. We write x* for the unique minimizer of 
and we introduce the residual vector e := Ax* - b. 

To streamline our discussion, we will often assume that each row a\ of the matrix A shares the same 
£ 2 norm: 

||flj|l2 = l for each i- (1.2) 

We say that A is standardized when (1.2) is in force. The spectral norm is denoted as ||-||, while IHIf 
represents the Frobenius norm. When applied to an Hermitian matrix, the maps A m i n and A max return 
the algebraic minimum and maximum eigenvalues. For apx q matrix W> we arrange the singular values 
as follows. 

o-max(W) :=<xi(W0 >o 2 (W) > • • • > a m m{p,q} (W) =:a min (W). 
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The minimum singular value is positive if and only if WW* or W* W is nonsingular. We define the con- 
dition number k{W) := cr max ( W) I cr m i n ( W) . The dagger t denotes the Moore-Penrose pseudoinverse. 
When W has full row rank, its pseudoinverse is determined by the formula := W* (WW*) -1 . 

1.2. The Simple Kaczmarz Method. The Kaczmarz method is an iterative algorithm that produces an 
approximation to the minimizer x* of the least-squares problem (1.1) . The method commences with an 
arbitrary guess Xo for the solution. At the jth iteration, we select a row index t = t(j) of the matrix A> and 
we project the current iterate Xj-\ onto the solution space of the equation (a t} x) = b t . That is, 

b t -(a t , Xj-i) 

x j = x J-i + a t (L3) 

ll%ll 2 

This process continues until it triggers an appropriate convergence criterion. 

To develop a complete algorithm, we also need a control mechanism that specifies how to select rows. 
For example, the most classical approach cycles through the rows in order. Instead, we focus on a mod- 
ern formulation that uses a randomized control mechanism. Randomization has several benefits: the 
resulting algorithm is easy to analyze, it is simple to implement, and it is often effective in practice. 

Our primary reference is the randomized Kaczmarz algorithm recently proposed by Strohmer and Ver- 
shynin |SV09 |. When A is standardized, their method operates as follows. At iteration j, independently 
of all previous random choices, the algorithm draws the row index t(j) uniformly at random from the 
set {1, ... } n} of all row indices. Then the current iterate is updated using the rule (1.3) . The paper |SV09 | 
provides a short, elegant proof that this iteration converges at an expected linear rate to the solution x* 
of a consistent least-squares problem (i.e., where the residual e is zero). 

Needell |NeelO | has extended the argument of |SV09 | to the case of an inconsistent least-squares prob- 
lem. For a standardized matrix A> NeedelFs error estimate reads 



nxj-xAl < 



le" 2 



x 0"**ll2 + 2 777 - (L4) 

<in^ 



n 

In words, the randomized Kaczmarz method converges in expectation at a linear rat^juntil it reaches a 
fixed ball about the true solution x*, at which point the error may cease to decay. The radius of this ball 
roughly equals the second term in (1.4) , while the convergence rate is controlled by the bracket. When 
the residual e is zero, the bound (1.4) reduces to the error estimate from ISV09I . 



1.3. The Block Kaczmarz Method. In some situations IEHL81I . practitioners prefer to use a block ver- 
sion of the Kaczmarz method to solve the least-squares problem (1.1) . We consider a formulation due 
to Elfving IElf80l . This procedure begins with an initial guess Xo for the solution. At each iteration j, we 
select a subset r = r(j) of the row indices of A> and we project the current iterate x/_i onto the solution 
space of A T x = b T , the set of equations listed in t. That is, 

Xj = xj-i + (A T )\b T - A T Xj-i). (1.5) 

This process continues until it has converged. We have written A T for the row submatrix of A indexed 
by r, while b T is the subvector of b with components listed in r. We assume that the row submatrix A T 
is fat (or square), so the pseudoinverse (1.5) returns the solution to an underdetermined least-squares 
problem. 

To specify a block Kaczmarz algorithm, one must decide what blocks of indices are permissible, as well 
as the mechanism for selecting a block at each iteration. In this paper, we study a version that is based 
on two design decisions. First, this algorithm requires a partition T = {ri,...,r m } of the row indices of 
A. The method only considers blocks of indices that appear in the partition T. Second, we use a simple 
randomized control scheme to choose which block to enforce. At each iteration, independently of all 



Mathematicians often use the term exponential convergence for the concept numerical analysts call linear convergence. 
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Input: 

• Matrix A with dimension nx d 

• Right-hand side b with dimension n 

• Partition T = {Ti, . . . , r m } of the row indices {1, . . . , n} 

• Initial iterate Xo with dimension d 

• Convergence tolerance e > 

Output: An estimate x for the solution to min x || Ax - b\\\ 



repeat 

Choose a block r uniformly at random from T 

Xj «- + (Ajj^foj - A T Xy_i) { Solve least-squares problem } 

until||AK 7 --fo||2<£ 2 



previous choices, we draw a block r uniformly at random from the partition T. These decisions lead to 
AlgorithmfTT] 

We postpone a detailed discussion on implementation to Section[5J 

1.4. Desiderata for the Partition. The implementation and behavior of the block Kaczmarz method de- 
pend heavily on the properties of the submatrices A T indexed by the blocks r in the partition T. Let 
us explain how the structure of the submatrices plays a role in the implementation; the claims about 



performance will emerge from our theoretical analy sis in Section 1.5 



The most expensive (arithmetic) step in Algorithm 1.1 occurs when we apply the pseudoinverse a\ to a 



vector. We can perform this calculation efficiently provided that each submatrix A T has well-conditioned 
rows. Indeed, in this case, we can invoke an iterative least-squares solver |Bjo96| , such as CGLS, to apply 
the pseudoinverse a\ approximately using a small number of matrix-vector multiplies with A T and A* . 

This observation highlights how important it is to control the geometric properties of the submatrices 
A T induced by T. Let us make a definition that encapsulates the information that we will need. 

Definition 1.1 (Row Paving). An (m, a, /3) row paving of a matrix A is a partition T - {t\ } . . . , r m } of the 
row indices that verifies 

a < A min [A T A* ) and A max {A T A * ) < p for each reT. 

The number m of blocks is called the size of the paving. The numbers a and /3 are called lower and upper 
paving bounds. The ratio fi/a gives a uniform bound on the squared condition number k 2 {A t ) for each 
r. Note that a = unless each submatrix A T is fat (or square). 

Every partition T of the rows of a matrix A has associated paving parameters (m, a, /3). In a moment, 
we will see how these quantities play a role in the performance of the algorithm. Roughly speaking, it 
is best that the size m, the upper bound /3, and the conditioning /3/ a of the paving are small. Later, in 
Sections [T77| and|3] we will discuss what kind of bounds we can expect on the paving parameters, as well 
as computational methods for producing good pavings. 

1.5. Convergence of Randomized Block Kaczmarz. The main result of this paper provides information 
about the convergence properties of the randomized block Kaczmarz method, Algorithm |l.l| in terms of 
the parameters of the row paving T. 
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Theorem 1.2 (Convergence). Suppose A is a matrix with full column rank that admits an (m, a, /3) row 
paving T. Consider the least-squares problem 

minimize || Ax - b \\ \ . 

Letx* he the unique minimizer, and define the residual e := Ax* - b. For any initial estimate xo, the ran- 
domized block Kaczmarz method, Algorithm \l.l\ produces a sequence {Xj : j > 0} of iterates that satisfies 



E\\Xj-x*\\ 2 2 



_ mm 



/3m 



o B \\e\\l 
\xo-xA 2 2 + - • , f . (1.6) 



Turn to Section[2]for the proof of Theorem |1.2| 

The expression l |1.6) states that the block Kaczmarz method exhibits an expected linear rate of conver- 
gence until it reaches a ball about the true solution. The radius of this ball, which we call the convergence 
horizon, is comparable with the second term on the right-hand side of (1.6) . The bracket controls the 
convergence rate. The minimum singular value of A affects both the rate of convergence and the con- 
vergence horizon. In each case, we prefer a m m(A) to be as large as possible. 

The properties of the row paving play an interesting role in Theorem |1.2| Curiously, the rate of con- 
vergence depends only on the upper paving bound /3 and the number m of blocks in the paving. On the 
other hand, the convergence horizon reflects the conditioning f>l a of the paving. Thus, the condition- 
ing of the paving only affects the error bound when the least-squares problem is inconsistent (i.e., e is 



nonzero). Nevertheless, as Section 1.4 suggests, we usually want the paving to be well conditioned to 



ensure that we can apply the block update rule (1.5) efficiently. 

1.6. Simple Kaczmarz versus Block Kaczmarz. Suppose A is a standardized matrix with an (m, a, p) row 

paving T. Let us compare the simple Kaczmarz algorithm with uniformly random control (Section [L2) 
to the block Kaczmarz method, Algorithm |l.l| Both methods satisfy an error bound of the form 

E||x,-x*|l! < e-^.||jcb — jcwlll + h 

where the convergence rate p and the convergence horizon h depend on the choice of algorithm. 
First, we compare the convergence rates of the two methods. The bounds (1.4) and (1.6) imply 

Psimp > and pbiock ^ — 7. (1.7) 

F n /3m 

because log(l - t) < -t when t < 1. In other words, the simple method requires a factor nl{f>m) more 
iterations than the block method to achieve the same reduction in error. 

Next, we examine the convergence horizons. The bounds (1.4) and (1.6) yield 



Their ratio satisfies 



^simp = — 7 — — and /Zbiock = 2 — TV,' 

a z . (A) a a z . (A) 

mm v 7 mm v 7 



ftbiock = )B Ml < P 
hsimp a nWeWlo ~ a' 



We see that the convergence horizon ^biock for the block method never exceeds /z S imp by a factor larger 
than the conditioning f>l a of the row paving. But //block m ay be substantially smaller than /z S imp if the 
components of the residual vector are highly nonuniform. 

To make more detailed claims about the relative merits of the two algorithms, we describe two situa- 
tions where we have additional information about the structure of the matrix A, or lack thereof. 
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1.6.1. Example 1: Fast Updates. First, we consider the case where the computational cost of the block 
update rule (1.5) is roughly comparable with the cost of the simple update rule (1.3) . This situation can 
occur when 

• Each submatrix A T admits a fast multiply; and 

• Each submatrix A T is well conditioned. 



See Section [4.3.1| for a numerical example where these properties hold. 

In this setting, one iteration of the block method has roughly the same cost as one iteration of the stan- 
dard method. As a consequence, the comparison (1.7) of the convergence rates provides a reasonable 
assessment of how much each algorithm reduces the error per unit of arithmetic. We see that the block 
algorithm really is about nl{f5m) times faster than the simple Kaczmarz method. When /3m is small in 
comparison with n> this represents a massive acceleration. 

1.6.2. Example 2: Unstructured Submatrices. On the other hand, suppose that each submatrix A T is un- 
structured and dense. Then the block method may involve much more arithmetic per iteration than the 
simple method. As a consequence, the comparison (1.7) is unfair to the simple method. 

In this case, it is more appropriate to examine the convergence rate per epoch, the minimum number 
of iterations it takes the algorithm to touch each row of A once. For the simple Kaczmarz method, an 
epoch consists of n iterations; for the block method, an epoch consists of m iterations. In this setting, 
each algorithm requires about the same amount of arithmetic in one epoch, so we consider the per- 
epoch convergence rates: 

m ■ Pbiock ^ — - — and n • p simp > a min (A) . 

We see that, in theory, the per-epoch convergence of the block method is worse, and the disadvantage 
increases with the upper bound /3 on the paving. 

In practice, the block method displays better convergence behavior than this estimate suggests. The 
block method accrues further advantages because of subtle computational issues involving data transfer 



and basic linear algebra subroutines (BLASx). We discuss these points in Section 4.1 and we provide 



some numerical support in Section 4.3.2 



1.7. Existence of Good Pavings. So far, we have assumed that the matrix A comes packaged with a nat- 
ural row paving T. For some of the applications we have in mind, this hypothesis is reasonable. Nev- 
ertheless, the block Kaczmarz method would be more versatile if we could construct row pavings for a 
broad class of matrices. To that end, Popa |Pop99 | has developed an approach for producing a paving 
of a sparse matrix; see also IPop OTj |Pop04| . But we can travel much farther down this road. It is an 



astonishing fact that every standardized matrix admits a good row paving. 

Proposition 1.3 (Existence of Good Row Pavings). Fix a number 8 e (0, 1). Let A be a standardized matrix 
with n rows. Then A admits a row paving whose parameters satisfy 

m < Cp aV e •5~ 2 ||A|| 2 log(l + n) and 1-8 <a< (3<l + 8. 

The number C pave is a positive, universal constant. 



Proposition! 1.3|follows most directly from a recent result |Tro09, Thm. 1.2], whose provenance can be 



traced to the celebrated papers IBT871 IBT9T1 . Although Proposition [O] is only an existential result, the 
literature describes several efficient algorithms for constructing row pavings. In particular, under some 
additional conditions, it is possible to pave a matrix by partitioning its rows at random. See Section|3]for 
more results and background on paving. 
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1.7.1. Understanding the Paving Theorem. Before we continue, let us take a moment to explain some 



of the key aspects of Proposition [L3j The main point is that the size of the paving depends only on the 
spectral norm of the matrix — not on the smallest singular value. As a consequence, it is possible to pave 
matrices with substantial null spaces! 

A second point is that we can make the squared conditioning f>l a as close to one as we desire. In par- 
ticular, the choice 5 = 0.5 yields /3/a < 3. This property licenses us to apply an iterative algorithm to solve 



least-squares problems involving the submatrices induced by the paving, as described in Section [L4} We 
remark that the dependence of the paving size m on the parameter 8 is optimaQas 5^0. 



Next, we develop some intuition about the role of the spectral norm in Proposition ! 1.3| Suppose that 
A is a standardized matrix with n rows. If the lower paving bound a > 0, then a row paving T of A must 
contain at least ft/rank(A) blocks. Otherwise, A T is rank deficient for some r e T simply because it has 
more than rank(A) rows. Now, using the fact || A\\^ < rank(A) ||A|| 2 , we easily verify that 



o n 
\A\\ 2 > 



rank(A) 

This bound is sharp. (Consider the two extreme examples: a matrix with orthonormal rows and a ma- 
trix with identical rows.) Therefore, we can view the squared spectral norm as a proxy for the minimal 
number of blocks in a row paving whose lower bound a > 0. 



We conclude that Proposition 1 1 .3| delivers a row paving whose size falls within a logarithmic factor 
of optimal. One may wonder whether it is possible to remove the logarithm. For general matrices, this 
question remains open. It is known IAnd79l [BHKW881 IBT9 1 1 that an affirmative answer would imply the 
long-standing conjecture of Kadison and Singer IKS59I . 

1.8. Paved with Good Intentions. We conclude the Introduction by merging our theorem on the con- 
vergence of the block Kaczmarz method with the result on the existence of pavings. 

Corollary 1.4 (Block Kaczmarz with a Good Row Paving). Suppose that A is a standardized matrix with 



full column rank. Let T be a good row paving of A, as guaranteed by Proposition 1.3 with S = 112. Under 



the notation ofTheorem \1.2\ the block Kaczmarz method, Algorithm \ 1 .l\ admits the convergence estimate 

1- 



E||x 7 --x*|| 2 < 



6C pave K 2 (A)log(l + rc) 



1,2 3 ^ll2 
l|Xb-**ll2 + ~~ 2 



where C pave is the constant from Proposition 1.3 



To summarize, Proposition 1 1 .3| yields a small paving of the matrix A with exceptional conditioning. 
With this choice of paving, we can perform the block Kaczmarz update (1.5) quickly using an iterative 
least-squares algorithm. (Indeed, to apply a\ with a fixed level of precision, it suffices to perform a con- 
stant number of matrix-vector multiplies with A T and A* .) Furthermore, we see that the block Kaczmarz 
method converges linearly with a rate that is controlled by the condition number of the matrix A, and the 
convergence horizon is on the same order as the size of the residual. This is essentially the best outcome 
one might hope for. 

1.9. Organization. The rest of the paper has the following structure. Section [2] contains a proof of the 
main result, Theorem |1.2| In Section|3j we give an overview of the literature on pavings. Section [i] dis- 



cusses numerical aspects of the block Kaczmarz method. We discuss related work on Kaczmarz methods 
and future directions in Section[5] Finally, Appendix|X|offers a proof of a supplemental result. 



2 To verify this point, consider a large matrix A whose entries have equal magnitude and independent random signs. Use the 
Bai-Yin Law IBY93I Thm. 2] to estimate singular values and norms. 
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2. Analysis of the Randomized Block Kaczmarz Algorithm 

This section contains the proof of the main result, Theorem 1.2 on the convergence of Algorithm |l.l| 
We commence with two simple lemmas. The first step provides a deterministic bound on how much one 
iteration of the algorithm reduces the error. 

Lemma 2.1. Instate the hypotheses and notation ofTheorem \1.2\ Then the error at iteration j satisfies the 
deterministic bound 

Uj-x^l^ni-^ArKxj^-x+nj + hi Mi 

where r = r{j) is the block selected at iteration j. 



Proof. According to the update rule (1.5) , block Kaczmarz computes 

Xj = Xj-i + A\ [b T - A T Xj-i) = Xj-i + A\A T {x+ - Xj-i) - A\e T , 

where we have introduced the decomposition b = Ax* - e> restricted to the coordinates listed in r. Sub- 
tract x* from both sides to obtain 

Xj-x± = (I- A\A T ){X]-i-x*)- A\e T . 

The range of a\ and the range of I - AlA T are orthogonal, so we may invoke the Pythagorean Theorem 
to reach 

||x 7 --x^|ll = ||a-AjA T )(x ; -_i-x^)||| + |l4e T |ll. 
The second term on the right-hand side satisfies 

\\Ale r \\ 2 2 < ex 2 maK i4)\\e r \\l < \ \\e T \\ 2 2 <-\\e T \\l 

where a is the lower bound on the row paving T. Combine the last two displays to wrap up. □ 

The second lemma gives us a means to average the two quantities appearing in Lemma [2J] over a 
random choice of the block r = t(j). 

Lemma 2.2. Instate the hypotheses and notation ofTheorem \1.2\ Suppose thatr is chosen uniformly at 
random from the row paving T. For fixed vectors u and v, it holds that 



.,2 

ehi-aIa t )u\\ 2 2 



u\\ 2 2 and E||i/^ = -H|i/||i 



(3 m 

Proof The second identity emerges from a very short calculation: 

Eii^iii = -E wer ii^ii^ = -ii^iii, 

which depends on the fact that the blocks of T partition the components of v. 

Since a\a t is an orthogonal projector, we may apply the Pythagorean Theorem to obtain the relation 

E||(I-4a t ) M ||2 = || M ||2-E||4a tM ||2. 
We control the remaining expectation as follows. 



E||4A TM ||2>E[^ in (Aj)||A TM ||2 



>\-E\\A T u\\ 2 2 



P 



1 i a 2 {A) 
= UcouW 2 = —\\Au\\ 2 > \\u\\ 2 . 

The second inequality depends on the bound o^inC^ = ^max^r) - P 1 - The fourth relation holds 
because the blocks in a paving partition the row indices of A. To complete the proof, we simply combine 
the last two displays. □ 
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The main result follows quickly once we merge the two lemmas. 

Proof of Theorem \h^ First, we bound the expected error at iteration j in terms of the error at iteration 
j - 1. Average the bound from Lemma 2.1 over the randomness in r = r(y) to reach 

E T \\xj - x* \\ 2 2 < E T || (I - 4 A T ) [xj-i - x*) 111 + ^ • E T || e T \\j 



/3 m 



\\ X] _ l - Xi< \\l + — \\e\\l 
J z am z 



The second inequality follows from Lemma |22) with u = Xj-\ - and v-e. 

By applying this result repeatedly, we can control the expected error after j iterations in terms of the 
initial error. Abbreviating y : = 1 - o-^ in {A) I ( /3 m) , we obtain the estimate 

E\\Xj - x* ||| = E T (i)E T( 2) • • • E T(y -) || x ; - - x* \\l 

^r'li^-^iii + ^ikiiifLUr') 



< r ;'||x -x*||! + ^ rllelli- 

ara(l-y) 

Reintroduce the value of y in this expression, and simplify to complete the proof. □ 

3. A Conversation about Pavings 

As we have seen, the properties of the row paving T of the matrix A have a significant effect on the 
behavior of the block Kaczmarz method, Algorithm |1.1| It is natural to ask if every standardized matrix 
A admits a good paving, and — if so — how we can exhibit such a paving. 

This section summarizes the main results from the literature on row pavings, with particular attention 
to algorithmic techniques for constructing pavings. We focus on two methods in particular. The first 
approach, which is more general, extracts a well-conditioned row submatrix from A and repeats this 
process until the paving is complete. The second approach, which is more automatic, simply forms a 
random partition of the rows with an appropriate number of blocks. 

For clarity, we only discuss row paving theorems for standardized matrices. For general matrices, it 
is often more natural to consider an alternative definition of a paving where the rows of the matrix are 
reweighted. For the reader's convenience, we include some citations that address the general case. 

Remark (Paving a Square Matrix) . The operator theory literature uses the term paving to refer to a parti- 
tion T = {Ti, . . . , r m } of the coordinates of a square matrix H with a zero diagonal in which each diagonal 
block satisfies the bound 

||H TT ||<(l-tf)||H|| foreachrer 
where the parameter 8 e (0, 1). We can obtain row paving results for a standardized matrix A by applying 
paving results for a square matrix to the hollow Gram matrix H = AA* - 1. This approach generally leads 
to an estimate for the size m of the paving that has an excessive dependence on the spectral norm of A. 

3.1. Subset Selection Theorems. The first approach to paving relies on a type of result called a subset 
selection theorem. This class of result asserts that, under appropriate conditions, a matrix contains a 



(large) set of rows with distinguished geometric properties. Proposition 1 1 .3| depends on the following 
subset selection theorem. 

Proposition 3.1 (Subset Selection). Fix a number 8 e (0, 1). Let A be a standardized matrix with n rows. 
Then there exists a subset r of row indices with the properties 

c 8^n 

|t|> Cs l\ n2 n and l-5<A min (A T ^)<A max (^ T ^)<l + 5. 
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The number c ss is a positive, universal constant. 



Proposition |3.1| follows from |Tro09, Thm. 1.2], once we track the parameter 5 through the proof. 



This result has been attributed to Bourgain and Tzafriri |BT91 1, but we cannot find any reference prior 



to ITrb091 . See Section [3.1.1| for further background. 

Proposition |3 . 1 1 ensures that each standardized matrix contains a large set of well-conditioned rows. 
To construct a paving of a standardized matrix A with row indices { 1 , . . . , n} , we apply this result to identify 
a large subset T\ of the row indices. We apply the same result to the set of remaining rows {1, . . . , n} \ T\ to 
bite off another subset T2, and so forth. After 

m<C pave .<r 2 ||A|| 2 log(l + rc) 



steps, we have exhausted the entire matrix. This argument yields Proposition 



1.3 



The paper |Tro09 | contains an efficient computational method for identifying the subset t promised 



by Proposition 3.1 This algorithm chooses a random set a) of rows from the matrix A with twice the 
cardinality of the desired subset t. Then it computes a matrix factorization of the submatrix A^ that 
exposes a well-conditioned subset r of rows inside a). 

3.1.1. Related Results. The literature contains a variety of subset selection theorems that can be used 
to construct pavings. The first major result in this direction is the Restricted Invertibility Principle of 
Bourgain and Tzafriri | BT87 , Thm. 1 .2] , which guarantees that every standardized matrix contains a set of 
const/ 1| A || 2 rows whose minimal singular value is bounded away from zero. In a similar vein, Kashin and 
Tzafriri |KT94 | establish that every standardized matrix contains a set of const/ 1| A\\ 2 rows whose norm is 
constant. Both results are based on random selection combined with matrix factorization. Neither result 
yields a submatrix whose singular values lie arbitrarily close to one. 

In another notable paper, Bourgain and Tzafriri |BT91 , Cor. 1.2] prove that every standardized matrix 
contains a set of const • S 2 / \\A\\ 4 rows whose squared singular values fall in the range [1 - 5, 1 + 8]. This 
theorem offers quantitative control on the conditioning of the submatrix, but it gives the wrong depen- 



dence on the spectral norm of the matrix. Proposition 3.1 is based on the same type of argument as this 



result of Bourgain and Tzafriri. The iterative argument we use to draw the paving result, Proposition 1.3 



as a corollary of Proposition |3. l| also appears in the paper IBT91I. 

The last few years have witnessed some striking advances in this area. Indeed, Spielman and Srivas- 
tava I SS12I have recently invented an elementary proof of the Restricted Invertibility Principle. Their 
method only involves linear algebra, and it leads to sharp constants. Youssef IYoul21 Thm. 4.2] has 
adapted these ideas to obtain an elementary proof of the Kashin-Tzafriri theorem IKT94I . These results 
are appealing because they construct the required subsets using an algorithmic procedure that admits a 
polynomial-time implementation. See ISrilOllNaollI for further exposition. 

Vershynin | VerOl | has obtained a theory of subset selection for matrices that are not necessarily stan- 
dardized. In his results, the squared Frobenius norm ||A|| 2 plays the role of the number n of rows. Sri- 
vastava's dissertation ISrilOl Chap. 3] contains an algorithm for (weighted) subset selection that applies 
to general matrices. See Youssef 's work IYoul2 l for the latest developments in this direction. 

Finally, let us mention that subset selection theorems have applications throughout mathematics and 
engineering. See the paper |CT06 | for a discussion and references. 

3.2. Randomized Methods for Paving. The second approach to paving has the benefit of utmost sim- 
plicity, but it is more limited in scope. The idea is to divide the rows of the matrix into random blocks 
of approximately equal size. Under additional assumptions, each submatrix induced by this partition 
is likely to be well conditioned. To describe this idea in more detail, we first introduce the concept of a 
random partition. 
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Definition 3.2 (Random Partition). Suppose that n is a permutation on {1, 2, ... , n}, chosen uniformly at 
random. For each i = 1, 2, . . . , m, define the set 

Ti = {7r(/c) : k = L(i - l)n/m\ + 1, L(i - l)n/m\ +2,..., \_in/m\}. 

It is clear that r = {Ti, T2, . . . , T m } is a partition of {1,2,..., n} into m blocks of approximately equal size. 
We say that T is a random partition of {1, 2, . . . , n} into m blocks. 

For every standardized matrix, we can use a random partition to construct a paving whose upper 
bound /3 is relatively small. 

Proposition 3.3 (Random Paving: Upper Bound). Let A be a tall, standardized matrix with n rows. Con- 
sider a randomized partition T of the row indices with m > \\A\\ 2 blocks. Then T is a row paving with 
upper bound ft < 61og(l + n), with probability at least! - n~ l . 



Proposition [33] results from an argument based on the matrix Chernoff inequality ITrol2l Thm. 1.1] 
and a union bound. A model for this type of proof appears in the paper ITrollI . We omit the details. 

In contrast, if we wish to construct a paving with a nontrivial lower bound a, we must place additional 
assumptions on the matrix. An example | BT9 1 , Ex. 2.2] of Bourgain and Tzafriri implies that we must 
assume the rows of the matrix A are weakly correlated to obtain a random paving with a > 0. It is natural 
to carve out a class of matrices that meet this requirement. 

Definition 3.4 (Incoherence). Suppose A is a matrix with rows a\, az, . . a n . We say that A is incoherent 
when 

Cine 



max I (ai, d£)\ 



< 



W log(l + n) 

The number Ci nc is a positive, universal constant. 

Incoherent matrices arise, for example, in signal processing problems | DH01 1 . Every incoherent, stan- 
dardized matrix admits a random paving with controlled lower and upper bounds. 

Proposition 3.5 (Random Paving). Suppose that A is an incoherent, standardized matrix with n rows. Let 
T be a random partition of the row indices into m blocks where m > C ran d-5~ 2 ||A|| 2 log(l + n). Then T is a 
row paving of A whose paving bounds satisfy 1-8 <a< /3 < 1 + 5, with probability at least l-n~ l . 



Proposition|3.5|follows from |Tro08a, Cor. 5.2], along with some standard arguments |BT91, Tro08b|. 



The paper |CD12| contains superior estimates for the constants in this analysis. See Section [3.2. 3| for 
some further background. 

Random paving is a striking idea because it is almost completely automatic. Given a guarantee that 
the matrix A is incoherent and an estimate for the spectral norm, we can obtain a good paving of the 
matrix without any further computation. 

3.2.1. The Fast Incoherence Transform. Prima facie, random paving has limited applicability because the 



incoherence hypothesis in Proposition |3.5| is rather stringent. Fortunately, there is a fast linear transfor- 
mation that can be used to convert any standardized matrix into an incoherent matrix that is nearly 
standardized. 

Definition 3.6 (Fast Incoherence Transform). The fast incoherence transform is the n x n random matrix 
S = FE where F is the n x n unitary discrete Fourier transform (DFT) and E is an n x n diagonal matrix 
whose entries are independent Rademachei^Jrandom variables. 



A Rademacher random variable takes the values + 1 with equal probability. 
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The matrix S is unitary, so it preserves Euclidean structure in a problem. Furthermore, S can be ap- 
plied to a vector in O (ft log ft) arithmetic operations by means of the FFT algorithm. Thus, for an n x d 
matrix A, we can form the product SA with O(ftdlogft) arithmetic operations. When the matrix A is 
sparse or admits a fast matrix-vector multiply, it may be more appropriate to work directly with the fac- 
torized representation SA, because it inherits a fast multiply from its two factors. 

The critical fact is that the fast incoherence transform converts a large class of standardized matrices 
into incoherent matrices that are nearly standardized. 

Proposition 3.7 (Fast Incoherence Transform). Suppose that A is a standardized matrix with n rows 
whose norm satisfies 

||A|| 2 <-^ . (3.1) 

log^(l + ft) 

Then the matrix W = SA satisfies the probability bound 

{ o c inc ) 1 

P<max\(w i} w £ )-8 i£ \> - — ->< -, 

I i,£ log(l + ft) J ft 

where w\ is the i th row ofW and 8[£ is the Kronecker delta. 

We do not have a reference for Proposition |3.7| but the result is probably not new. Appendix|X|offers a 
short proof based on the Hanson-Wright inequality |HW71 1 for Rademacher chaos. 

Let us pause to examine the hypothesis ( 13.1) . Recall that, for a standardized matrix A with n rows, 
the squared spectral norm ||A|| 2 attains its maximal value n when the rows are identical. Therefore, 
the bound ( |3.1) stipulates that the rows of A must exhibit a small amount of diversity. In this case, the 
fast incoherence transform spreads out the diversity evenly so that no pair of columns is correlated too 
strongly. 

3.2.2. Incoherence, Paving, Kaczmarz. This discussion suggests the following approach for solving the 
overdetermined least- squares problem (1.1) with a standardized matrix A: 

(1) Apply the fast incoherence transform S to the objective function: 

\\Ax-b\\l = \\SAx-Sb\\j=: \\Ax-b\\l. 

(2) Draw a random partition T of the rows of A with the number of blocks determined according to 



Proposition 3.5 



(3) Apply the randomized block Kaczmarz method, Algorithm ! l.l| with the matrix A, the right-hand 
side fo, and the paving T to solve the transformed problem. 



Provided that the hypothesis (3.1) is in force, Proposition |3.7| shows that the fast incoherence transform 
is likely to convert the matrix A into an incoherent matrix A. Proposition 3J5 shows that we can obtain 
a good paving of A from a random partition of the rows. If these randomized procedures are successful, 
when we apply the block Kaczmarz algorithm to solve the transformed least-squares problem, we obtain 
the same convergence guarantees outlined in Corollary |1.4| 

3.2.3. Related Results. The idea that randomness might help us to construct a paving is already inherent 
in the Bourgain-Tzafriri paper |BT87 | on the Restricted Invertibility Principle, where they use random- 
ized row selection and matrix factorization to perform subset selection. A result IBT911 Thm. 2.3] in their 
subsequent paper demonstrates that, in the presence of incoherence assumptions, a random partition 
induces a row paving with const - ||i4|| 4 log(l + ft) blocks; the extra factorization step is not necessary. 
Furthermore, under an incoherence assumption slightly stricter than Definition |3.4| they prove that a 
random partition induces a paving with const • ||A|| 4 blocks; no logarithmic factor is necessary. See the 
paper |Tro08b | for a modern proof of the latter result. 
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The aforementioned theorems all yield the wrong dependence on the spectral norm in the size of a 
random row paving. The precis |Tro08a| shows how to obtain the correct quadratic dependence that is 
quoted in Proposition [3^51 If we were to strengthen the incoherence requirement in Proposition [3^51 it 



is likely that we could remove the logarithmic factor from the size of the paving by adapting an argu- 
ment IBT91I Prop. 2.7] of Bourgain and Tzafriri. On the other hand, the logarithmic factor is necessary at 
the incoherence level we have imposed |BT91 , Ex. 2.2]. 

The fast incoherence transform is based on ideas of Ailon and Chazelle |AC09|, who use the random 
matrix S to perform dimension reduction. We believe that the first application of S for randomized linear 
algebra appears in the paper Woolfe et al. IWLRT08I . where they use this transform to aid in computing 
matrix decompositions. See the works |AMT10, HMT 1J3IBG12 1 for further results in this direction. Lib- 
erty's dissertation |Lib09 | describes other randomized maps that can play a similar role. 



4. Numerical Aspects of Block Kaczmarz 
The main goal of this paper is to study the theoretical properties of the block Kaczmarz method, but 



we believe that a short discussion about numerics can also provide some useful insights. In Section|4~T 
we outline some of the situations where we expect the block Kaczmarz method to outperform the sim- 
ple Kaczmarz algorithm. Afterward, in Section [42} we consider some of the questions that arise when 



implementing the block Kaczmarz method. Finally, in Section |4~3] we offer some simple computational 
examples to illustrate our main points. 

4.1. Why Use Block Kaczmarz? It is natural to ask when it might be better to use the block Kaczmarz 
scheme instead of the simple Kaczmarz scheme. The answer to this question depends heavily on the 
specific application at hand, as well as the architecture of the computers on which the algorithm is im- 
plemented. 

First, least-squares problems sometimes involve a matrix that admits a natural row paving. In this 
case, it may be advantageous to exploit the paving algorithmically. For example, certain signal processing 
applications involve multi-sampling schemes, where we collect several batches of uniform time samples 
of a signal. Each sample set produces a set of equations that is easy to solve. The block Kaczmarz method 



provides an effective way to use this structure IFS95I . See Section |4.3.1| for a related numerical example. 

Second, the block Kaczmarz algorithm can be implemented more efficiently than the simple Kacz- 
marz algorithm in many computer architectures. This claim rests on two facts. First, data transfer now 
plays a major role in the cost of numerical algorithms. The block Kaczmarz algorithm is efficient in 
this regard, because it moves a large block of equations into working memory and operates with it for 
some time. In contrast, the simple Kaczmarz method repeatedly transfers new equations into working 
memory. Second, the block Kaczmarz algorithm can exploit high-level basic linear algebra subroutines 
(BLASx). Indeed, inner products dominate the arithmetic in the simple Kaczmarz method, while it is 
possible to implement the block Kaczmarz algorithm using matrix-vector products. As a result, block 
Kaczmarz relies on BLAS2, rather than BLAS1. A more detailed discussion of these issues falls outside the 
scope of this paper. 

4.2. Implementing Block Kaczmarz Methods. The randomized block Kaczmarz method, Algorithm |l.l| 
is easy to describe, but there remain several implementation issues that require attention. 

4.2.1. The Block Update Rule. Most of the arithmetic in the algorithm occurs when we apply the pseu- 
doinverse a\ to a vector in the update (1.51 . Equivalently, we must solve an (underdetermined) least- 
squares problem at each iteration. The appropriate numerical method depends on a wide variety of 
issues |Bj696|, so we cannot give a universal prescription. Here are some factors worth considering. 
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• When the blocks in the paving are very small, a direct method based on QR decomposition or the 
SVD is likely to be fastest. A direct method may also be appropriate when the conditioning /3 la 
of the paving is high and the matrix is dense. 

• In case the conditioning /3/ a of the paving is small, we recommend using an iterative method, 
such as CGLS, LSQR, or the Chebyshev semi-iterative method. These techniques may also be ap- 
propriate for very sparse problems. It is not necessary to run these iterations until they have 
converged fully, and the algorithms will benefit from warm starts provided by the convergence of 
the outer iteration. 

4.2.2. Setting the Convergence Tolerance. Theorem 1 1 .2| implies that Algorithm 1 1 . 1 1 can reduce the error 
|| x - x* || 2 to a level comparable with the convergence horizon. Let us examine how this fact affects our 
choice of the convergence tolerance. 

Combine the convergence criterion \\Ax - b\\\ < e 2 with the decomposition b = Ax* - e to obtain 

\\A(x-x*) + e\\ 2 <£ 2 . 

The residual e is orthogonal to the range of A, so we can apply the Pythagorean theorem and bound the 
£2 norm below to reach 

^ in (A)||x-^||2 + ||^||2< £ 2 . 



It follows that we must set the convergence tolerance e so that 

111- (4.1) 
Otherwise, we have no guarantee that the algorithm will terminate. 



E 2 > 



a 



4.2.3. Checking for Convergence. The convergence criterion \\Ax - b\\ 2 < e 2 maybe somewhat expensive 
to verify because it involves a multiply with the full matrix A. As a consequence, it is better to check for 
convergence only on occasion. For instance, if the paving consists of m blocks, we might only evaluate 
the convergence criterion every m iterations. 

4.2.4. Randomized Cyclic Control. Finally, in practice, the block Kaczmarz algorithm is more effective if 
we use a randomized control scheme different from the one we have analyzed. Recall that Algorithm |l.l| 
samples a uniformly random block at each iteration, independent of all previous choices. Instead, we 
recommend using an alternative scheme that samples blocks without replacement. We can express this 
method formally as follows. 

(1) For each q = 0, 1,2, ... , draw an independent, uniformly random permutation n q on the indices 
{1, . . . , m} of the blocks. 

(2) For each iteration j > 1, decompose j = qm + r> where q and r are nonnegative integers with 
< r < m. Select the block 1(7) = ( r+ i). 

In other words, at each epoch, we cycle through all the blocks in random order. 

In Section [4.3.1| we offer some numerical evidence that this alternative control scheme is more effec- 
tive than the approach used in Algorithm |l.l| At present, compelling explanations for this phenomenon 
are lacking. See IRR1211 for some discussion and conjectures. 

4.3. Numerical Experiments. In this section, we present some numerical experiments to complement 
our discussions about the implementation and theoretical performance of the randomized block Kacz- 
marz method, Algorithm |1.1| In Section [4.3. 1| we consider an example where the block method has a 



clear advantage over the simple method. In Section 4.3.2 we examine some other cases where the ben- 
efits of the block method are more subtle. 
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4.3.1. Submatrices with Fast Multiplies. First, we study the situation where the matrix A comes with a 
natural partition of the rows into well-conditioned blocks, each admitting a fast matrix-vector multiply. 



As we have discussed in Section [T.6.1| the block method has a significant advantage over the simple 
method in this setting. Our experiments bear out this point. 

We build a 300 x 100 matrix A C i rc by stacking 15 partial random circulant matrices: 



A c \rc — 



Cl5 



where Cj = HF*EjV for i = 1,..., 15. 



In this expression, 

• R is the restriction to the first 20 coordinates, 

• F is the 100 x 100 unitary DFT, and 

• Ei is a random diagonal matrix whose entries are independent Rademacher random variables. 
Each Ei is drawn independently from the others. 

This construction ensures that each C; is a section of a circulant matrix with orthonormal rows. As a 
consequence, the pseudoinverse equals the adjoint: cj = C* . Furthermore, we can apply either Q or C* 
to a vector quickl^Jusing one FFT and one inverse FFT, each of length d. 

Using this matrix, we perform a small Matlab experiment to compare the behavior of randomized 
block Kaczmarz and randomized simple Kaczmarz. 

(1) Draw a random matrix A C i TC according to the recipe above, and fix it. Let T be the row partition 
with 15 blocks induced by the circulant structure. 

(2) Let x+ = [1, . . . , 1] *, and form b = Ax*. Set the initial iterate Xo = 0. 

(3) For each of 100 trials, 

(a) Apply the simple Kaczmarz method from Section 1.2 



(b) Apply Algorithm[TT]to produce iterates {x^ lock : j>0 . 



to produce iterates {x. . j > 0}. 



(4) For each algorithm, at each iteration j, compute the minimum, median, and maximum value of 
the approximation error ||x^ lg - x* H2 over the 100 trials. 

In this experiment, we form the matrix A C i rc in the first step and store it. To perform the update ( |1.3| >, 
the simple Kaczmarz method loads the required row from memory without doing any extra computa- 
tion, so the cost of the simple update rule is just 4d complex flops, where d = 100. The block Kaczmarz 
method uses the structure of the circulant blocks, as described in the previous paragraph, to perform the 
update 5) with about 4dlog 2 (d) + 4d complex flops. 

In Figure[l][left panel], we plot the approximation error as a function of the number of flops expended 
by each algorithm. The heavy lines indicate the median error over 100 trials, while the minimum and 
maximum errors describe the boundaries of the shaded region. The block algorithm reduces the error to 
10~ u in about 1.6 x 10 6 complex flops, while the simple algorithm requires about 3.2 x 10 7 complex flops 
to achieve the same result. This amounts to a 20-fold reduction in the amount of arithmetic! 



In Section 4.2.4 we claim that the block Kaczmarz algorithm is more effective when we use an alterna- 
tive control scheme that samples blocks without replacement. To illustrate this point, let us perform an 
experiment with the block circulant matrix using the same methodology as above. Figure[l] [right panel] 
compares the performance of the block Kaczmarz method when we sample blocks with and without re- 
placement. This chart shows that sampling without replacement reduces the amount of computation by 
about 15%. In other experiments, we have seen even more substantial improvements. 



Recall that an FFT or inverse FFT of length d requires roughly dlog 2 id) complex floating-point operations (flops), although 
the precise count depends on arithmetic properties of the number d. 




Figure 1 (Convergence with a Block Circulant Matrix). The matrix A clYC is a fixed 300 x 100 matrix consisting 
of 15 partial circulant blocks. For each algorithm, we chart the approximation error \\Xj - jc* H2 as a function 
of the number of complex flops performed. The heavy line denotes the median error over 100 independent 
trials; the minimum and maximum errors envelop the shaded region. See Section 4.3.1 for details. [Left] Con- 
vergence of the randomized block Kaczmarz method, Algorithm |l.l versus the randomized simple Kaczmarz 
method (Section [L2) . [Right] Convergence of two variants (Section 4.2.4) of the randomized block Kaczmarz 
method. One samples blocks independently with replacement; the other samples blocks without replacement 
in each epoch. 




Figure 2 (Convergence with a Random Matrix). The matrix A Y ^ m is a fixed 300 x 100 matrix whose rows are 
drawn uniformly at random from the Euclidean unit sphere and partitioned into 10 blocks of equal size. For 
the iterates generated by each algorithm, we plot the decay of the approximation error \\xj - H2 as a function 
of three computational resources. The heavy line denotes the median error over 100 independent trials; the 
minimum and maximum errors envelop the shaded region. [Left] Approximation error as a function of the 
number of epochs elapsed. [Center] Approximation error as a function of flops. [Right] Approximation error 
as a function of CPU time. See Section|4.3.2|for details. 



4.3.2. Unstructured Submatrices. Next, we consider some examples where we cannot exploit the struc- 
ture of the submatrices to accelerate the block Kaczmarz method. Although our theory does not yield 
higher rates of convergence, the numerical evidence still suggests that the block Kaczmarz method of- 
fers a decisive improvement over the simple Kaczmarz method. 

First, we consider a 300 x 100 real matrix A rc im whose rows are drawn independently and uniformly at 
random from the £2 unit sphere. We partition the rows of this matrix into 10 equal blocks of 30 consecu- 
tive rows each. A heuristic application of the Bai-Yin Law |BY93, Thm. 2] shows that the singular values 
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of each 30 x 100 submatrix fall in the range 1 + V30/100. Thus, the paving parameters a « 0.2 and /3 « 2.4, 
and the condition number of each block is bounded by fi/a ~ 4.8. 



We perform an experiment with this matrix that follows the same methodology as in Section 4.3.2 



To implement the block Kaczmarz update rule (1.5) , we use the CGLS algorithm to solve the underde- 
termined least-squares problem. We use warm starts, and we halt the least-squares iteration before it 
has fully converged to control the computational cost. Our code counts the actual number of (real) flops 
during each iteration, and it measures the actual CPU time that is expended. 

Figure [2] shows three different views of the data we collected during this experiment. The left panel 
shows the approximation error for the simple and block Kaczmarz method as a function of the number of 



epochs elapsed. As we expect from the discussion in Section [T.6.2| the two algorithms show a similar rate 
of convergence in this view. In fact, the block method does somewhat better than the theory predicts. 
In the center panel, we chart the approximation error as a function of the number of flops performed. 
The simple algorithm requires about 5 x 10 6 flops to achieve an error of 10 -11 , while the block algorithm 
takes 2 x 10 7 flops to reach the same point. Thus, the block method needs four times as much arithmetic. 
But the story is not over. The right panel displays the approximation error as a function of CPU time. 
We discover that our implementation of the block method is 10 times faster than the simple method! It 
is dangerous to draw conclusions about the efficiency of an algorithm from the behavior of a Matlab 
script. Nevertheless, we regard this experiment as limited evidence of the computational advantages of 
the block method that we outlined in Section lij] 

Finally, we consider one other type of unstruc- 
tured matrix, where our analysis offers little guid- 
ance. Let A C0 h be a 300 x 100 matrix whose entries 
are drawn independently at random from the uni- 
form distribution on [0.5, 1]. We form a standard- 
ized matrix A co h by normalizing the rows of A co h. 
Since this matrix is dense and positive, the rows 
of this matrix tend to be strongly correlated with 
each other; the maximum inner product between 
rows is typically 0.98 or higher. Furthermore, the 
norm of this matrix is usually close to the maxi- 
mum possible value V300. Our theory provides 
no map for this terra incognita. Yet we brazenly 
partition the rows into 10 equal blocks of 30 rows 
each, as in the previous example. 

We perform the same experiment for A co h as 
we did with A X( \ m to compare the behavior of 
the simple Kaczmarz method and the random- 
ized Kaczmarz method. Figure [3] shows the re- 
sults of this trial. We see that the simple Kaczmarz 
method scarcely reduces the error at all, while the 
block method achieves a healthy rate of conver- 
gence. The paper INW12I provides an analysis of 
this example in the case where each block con- 
tains two rows, but we do not yet have a complete 
explanation for the performance of Algorithm 1 1.1 1 
when the blocks are larger. 
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Figure 3 (Convergence with a Coherent Matrix). The ma- 
trix A co h is a fixed 300 x 100 matrix with strongly corre- 
lated rows that are partitioned into 10 blocks of equal 
size. We compare the approximation error \\xj - jc* H2 for 
the iterates generated by each algorithm as a function of 
the number of flops performed. The heavy line denotes 
the median error over 100 trials; the minimum and maxi- 
mum error describe the boundaries of the shaded region, 
for details. 



See Section 



4.3.2 
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5. Related Work and Future Directions 



We conclude with a short discussion about recent research on randomized Kaczmarz methods and 
block variants of the simple Kaczmarz method. Finally, we offer some musings about opportunities for 
further research. 

5.1. Kaczmarz Methods with Randomized Control. The Kaczmarz method was originally introduced 
in the paper |Kac37 |. It was reinvented by researchers in tomography IGBH70I under the appellation 
"algebraic reconstruction technique" (ART). See Byrne's book |Byr08| for a contemporary summary of 
this literature. 

The classical variants of the Kaczmarz method rely on deterministic mechanisms for selecting a row at 
each iteration. Indeed, the simplest version just cycles through the rows in order. It has long been known 
that the cyclic control scheme performs badly when the rows are arranged in an unhappy order |HS78 |. 
The literature contains empirical evidence |HM93 | that randomized control mechanisms may be more 
effective, but until recently there was no compelling theoretical analysis to support this observation. 

The paper |SV09 | of Strohmer and Vershynin is significant because it provides the first explicit con- 
vergence proof for a randomized variant of the Kaczmarz algorithm. This work establishes that a ran- 
domized control scheme leads to an expected linear convergence rate, which can be written in terms of 
geometric properties of the matrix. In contrast, deterministic convergence analyses that appear in the 
literature often lead to expressions, e.g., |XZ02, Eqn. (1.2)], whose geometric meaning is not evident. 

In the wake of Strohmer and Vershyniris work |SV09 |, several other researchers have written about 
randomized versions of the Kaczmarz scheme and related topics. In particular, Needell demonstrates 
that the randomized Kaczmarz method converges, even when the linear system is inconsistent INeelOI . 
Zousias and Freris IZF12I exhibit a randomized procedure, based on ideas from |Pop98 |, that can reduce 
the size of the residual e. Leventhal and Lewis ILL101 provide an analysis of a randomized iteration for 
solving least-squares problems with polyhedral constraints, while Richtarik and Takac have extended 



these ideas to more general optimization problems IRT111 . Some other references include lE NTTllRrTTI 

[CPT^[NWl2l . 

5.2. Block Kaczmarz Methods. The block Kaczmarz update rule [1.5) we are studying is originally due 
to Elfving IElf80l Eqn. (2.2)]. This update is a special case of a general framework due to Eggermont et 
al. |EHL81 1. Byrne describes a number of other block Kaczmarz methods in his book |Byr08, Chap. 9]. 

By now, there is an extensive literature on the convergence behavior of block projection methods, 
a class of algorithms that includes block Kaczmarz schemes. In particular, we call out the work of Xu 
and Zikatanov |XZ02|, which contains a refined convergence analysis that applies to a wide range of 
algorithms. Nevertheless, to our knowledge, Algorithm |l.l| is the only block Kaczmarz method that offers 
an (expected) linear rate of convergence that depends explicitly on geometric properties of the system 
matrix A and its submatrices A T . 

Most of the literature on block Kaczmarz methods assumes that a partition T of the rows of the matrix 
is provided as part of the problem data. We are aware of some research on the prospects for partitioning 
a matrix in a manner that is favorable for block Kaczmarz methods. In particular, Popa |Pop99| has 
introduced an algorithm for partitioning a sparse matrix so that each block contains mutually orthogonal 
rows. Popa has pursued this idea in a sequence of papers, including |Pop01| |Pop04] . We believe that 
our work is the first to recognize the natural connection between the paving literature and the block 
Kaczmarz method. 

5.3. Some Future Directions. There are a number of interesting open questions connected with Algo- 
rithm [Li] First, empirical experiments make it clear that a control strategy based on sampling without 
replacement is far more effective than our strategy based on sampling with replacement. At present, 
there is no compelling explanation for this phenomenon (but see |RR12|). Second, we think that other 
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types of block Kaczmarz update rules |Byr08, Chap. 9] would also submit to a convergence analysis sim- 
ilar to the proof of Theorem |1.2| Third, it might be worthwhile to extend the argument of Zousias and 
Freris IZF12I to develop a method with a smaller convergence horizon. 

Block algorithms for numerical linear algebra are almost as old as the field itself. Gauss himself sug- 
gested a block version of his algorithm for solving linear systems IGau95llBen09l . Many other algorithms 
for numerical linear algebra IIGVL96I and least-squares problems |Bjo96| admit natural block variants. 
The field of optimization also contains a wide variety of block methods, such as | AC89 , Tse93 1 . 

We believe that row pavings can also play a role in the development and analysis of other types of 
block algorithms. For instance, it is straightforward to develop a block version of the randomized Gauss- 
Seidel algorithm introduced in |LL10 |. Using the techniques in this paper, we can easily bound the rate 
of convergence for this algorithm in terms of the properties of a column paving of the system matrix. It 
would be interesting to pursue this example and others. 



Appendix A. The Fast Incoherence Transform 



The goal of this Appendix is to prove Proposition |3.7| which states that the fast incoherence transform 
makes a standardized matrix incoherent with high probability. To that end, suppose that A is a matrix 
with n unit-norm rows, denoted a\>... y a n . Recall that the fast incoherence transform is the matrix S = 
FE, where F is the n x n unitary DFT and E is a diagonal matrix whose entries £ i , . . . , t; n are independent 
Rademacher random variables. 

Consider the matrix W := SA, and introduce the Gram matrix of its rows G := WW* = SAA*S*. Ex- 
panding this product, we find that the [i,£) entry of G takes the form 



gi£ = Jljk^jtk'{ijfek'(aj> a>k)> 



where we write fij for the (z\ j) entry of the DFT matrix. (We use the convention that the inner product is 
antilinear in the second coordinate.) This expression allows us to calculate the expectation of the Gram 
matrix with ease. Since the rows f i , . . . , f n of the DFT form an ortho normal family, 



Y = 



where Si£ is the Kronecker delta. Note that we have invoked the standardization assumption here. 

The real content of the argument is to obtain a bound on how much the entries of the Gram matrix 
deviate from their expected values. Fix a pair [i y £) of indices, not necessarily distinct, and define the 
random variable 

Y:=\git-8it\. 
We can rewrite Y is a more symmetric manner: 

YtjtkZjtk'yjk wh ere y]k-=^[Ujhk + iikfi])(a], «*)• (A.1) 

We see that the random variable Y is a symmetric, homogeneous, second-order Rademacher chaos. We 
can bound the probability that Y is large by invoking a result of Hanson and Wright |HW7l); see IFR121 
Chap. 8] for a modern proof. 

Proposition A.1 (Hanson-Wright). Consider the chaos variable Y defined in (A.!) . Let Y be the Hermitian 
matrix whose (j, k) entry is yjjc and whose diagonal entries are zero. Then 

P{Y> r}<2expi-c hw -mini^— ,— ^-tl fort>0. 

y \ \ \\y\\ || r n| j j J 

The number Ch w is a positive, universal constant. 
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To apply this result, we need to obtain bounds for the Frobenius norm and spectral norm of the matrix 
Y. Note that 

Y = i(Z + Z*) where Z := diag(f/) • (AA* - 1) • diag(fy)*, 

and diag(-) converts a vector into a diagonal matrix. We may now calculate the required norms of Y. 
First, 

|| Fll < ||Z|| < ||diag(f/)|| • II AA* -I|| • ||diag(f,)|| = - 1| AA* -I|| < -||A|| 2 . (A.2) 

n n 

The first inequality depends on the convexity of the spectral norm and its invariance under the conjugate 
transpose. The third relation holds because the entries of the vectors f; and all have magnitude n~ l/2 . 
The last inequality follows because || AA* -I|| <max{||A|| 2 - 1,1} < ||A|| 2 . For similar reasons, 

||F|| 2 < ||Z|| 2 = || AA* -III 2 < ^||AA*|| 2 < -||A|| 2 . (A.3) 

The strict inequality holds because AA* has a unit diagonal, which the identity matrix cancels off. The 
final bound follows from the interpolation ||P|| 2 < trace (P) ||P||, valid when P is positive semidefinite. 



Next, let us instate the hypothesis (3JJ from Proposition 3.7 



o Cfit • n 
\A\\ 2 < 



log 3 (l + rc)' 

Introduce this assumption into the bounds flA.2) and HA.3) , and apply the Hanson-Wright inequality, 
Proposition |A.l| to reach 

P{Y> r}<2expi-— -log^l + ^-minlr, t 2 }\ . 

I Cfit J 

We may set the constant Cfi t := Ch w Cmc 2 /3. For the choice t = Ci nc / log(l + n), we discover that 

-3 



, l y Cine 1 < 2 (l + fi)" 

1 log(l + J 



In other words, it is unlikely that any single pair of rows from W = SA has an inner product much different 
from its expectation. 

To complete the argument, we unfix the pair [i,£) of indices. Forming a union bound over all n{n+ 1) 12 
choices where i < £, we conclude that 



Jmaxlgtf-tftfl > — l<(l + f7) _1 . 

1 i,£ in log(l + rt)J 



Therefore, with high probability W is an incoherent matrix that is nearly standardized. 
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